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vision features to accurately understand and interpret the meaning of the signer or vice 

versa. In the current study, we automatically distinguish hand signs and classify seven 
basic gestures representing symbolic emotions or expressions like happy, sad, neutral, disgust, 
scared, anger, and surprise. Convolutional Neural Network is a famous method for 
classifications using vision-based deep learning; here in the current study, proposed transfer 
learning using a well-known architecture of VGG16 to speed up the convergence and improve 
accuracy by using pre-trained weights. We obtained a high accuracy of 99.98% of the proposed 
architecture with a minimal and low-quality data set of 455 images collected by 65 individuals 
for seven hand gesture classes. Further, compared the performance of VGG16 architecture 
with two different optimizers, SGD, and Adam, along with some more architectures of 
AlexNet, LeNet05, and ResNet50. 
Keywords: Pakistan Sign Language, Hand Gestures, Convolutional Neural Network (CNN), 
VGG16, Transfer Learning. 
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INTRODUCTION 


It is estimated that around 480 million people have hearing loss worldwide [1]. Sign 
Language (SL) is the way to reduce the gap between deaf and ordinary people. But there is the 
fact that several people with hearing loss even know sign language depends on translators or 
interpreters because many normal ordinaries are unaware of sign language. In this context, any 
system that automatically translates sign language is beneficial improving mute people’s lives 
by increasing social enclosure and self-freedom of expression [2]. 

Consequently, developing technologies are improving the hurdles of deaf and mute 
people, but the problem sphere of sign language is still a challenging domain [3]. Among 
several problems, one of them is a variety of sign languages, just like spoken languages. Over 
the world, there is no global sing language. Therefore, to the best of our knowledge, the 
applicant converts a generic sign language interpreter to lessen the gap between mute, hard- 
of-hearing people and everyday individuals [4]. American Sign Language (ASL), Chinese Sign 
Language (CSL), Indian Sign Language (ISL), British Sign Language (BSL), etc., all are 
different. Subsequently, Pakistan Sign Language (PSL) has its rare as Urdu Sign Language 
(USL). Research and development space for Pakistan Sign Language is enormous; compared 
to other languages, PSL needs special attention to automate the physical characteristics of 
users and interpret them into meaningful text or conversation [5]. 

Understanding emotions in human gestures are essential to understanding subjective, 
physiological, and expressive components. These three elements play an essential role in 
understanding the emotional response of individual components, that how any individual 
experiences any emotion which creates the body’s reaction and behavior, which are called 
physiological and expressive components, respectively [6]. 

With the development and growth in Machine Learning (ML), many applications 
benefit from interpreting Sign Language. Among several novel architectures of Machine 
Learning, Neural Network (NN) is one of the most commonly used architectures [7], [8]. 
Artificial Neural Network (ANN) simulates the behavior of biological neurons) it is built up 
of several layers of artificial neurons. Different features are detected by several specialized and 
dedicated layers of neurons. Since the signs in sign language are varied and consist of different 
patterns like hand posture, shifting, and movement, using an artificial neural network helps 
develop the systems that detect and interpret signs [9]. 

Recently, Convolutional Neural networks (CNN) have been growing fast for vision- 
based deep learning. Generally, when initial weights are initialized randomly, the model 
training takes a long time with CNN [10]. Transfer learning can increase the speed of training 
time by using the weights from the previous training activity architecture as an initial weight. 
The new architecture uses the new dataset and additional layers exactly like the previous 
architecture to transfer the weights. Those assigned weights can be maintained or changed 
during the training of the new architecture[11]. 

In the current research, we investigate the extension of Convolutional neural networks 
work (CNN) and Transfer Learning and how effective techniques are to extract hand features 
and interpret hand signs of PSL. We used the Visual Geometry Group (VGG) models with 
two unlike optimizers one is Stochastic Gradient Descent (SGD), and the second is Adam[12]. 
Considering VGG, we have selected the VGG16 model with a pre-trained architecture using 
the ImageNet data consisting of 1.2 million color images, and the number of classes is 
1000/13]. By adapting the pre-trained model, we performed very well with minimal and low- 
quality data of 65 individuals having 455 images in total using Adam optimizer. Further, we 
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analyzed the effect of VGG16 transfer learning architecture using Stochastic Gradient Descent 
(SGD) and Adam optimizers. Supplementary to another architecture AlexNet, ResNet-50, and 
LeNet-05 based on CNN wete also tested and there is no successful model for our dataset but 
provided a competent architecture for hand sign classification. The current paper contributes 
in the following ways: 

1. A CNN that interprets emotions through hand signs using Pakistan Sign Language 

2. A publically available dataset for Pakistan Sign Language hand signs representing 

emotions 

3. An approach based on transfer learning adapted the model with maximum accuracy 

with a minimal and low-quality dataset. 

4. A comparison of AlexNet, LeNet05, ResNet50 and VGG16 architecture. 

5. WGG16 architecture is evaluated with two different optimizers SGD and Adam. 

6. A comparison of our model with existing state-of-the-art models. 

Current section 01, based on the domain introduction and motivation of the present paper 
is composed of the structure as follows: Section 02 boons a background of Transfer Learning 
along with some deep learning architectures. Section 03 describes the methodology, where the 
dataset and adapted architecture are discussed. In Section 04, results and evaluations are being 
evaluated. Finally, Section 5 provides the conclusion and future directions. 
BACKGROUND 

In Sign Language, hand gesture and posture plays a vital role in understanding[14]. 
The recognition of hand gestures is the technique of analyzing a pattern in images. It generally 
requires four steps to recognize any motion, which are as follows: image acquisition: retrieves 
unprocessed labeled images from a source image pre-processing: performs grayscale 
conversion to several other techniques to represent images as matrices of pixels. Feature 
Extraction: in this step, analyzing image patterns are analyzed and extracted image 
classification: new or unseen images are checked or classified among predefined classes [15]. 
Several machine learning-based hand gesture recognition methods are in use [16]—[18]. 
However, an encouraging line of plans are in use nowadays are those techniques are based on 
Deep Learning and further built on CNNs [19]. Using CNN is particularly suitable for image 
recognition because it automates the feature extraction process and avoids poor generalization; 
it also decreases bias in traditional algorithms [20]. CNN takes a long time to converge by 
using random weights. The solution to this issue can use transfer learning for better accuracy 
and a prompt training time [21]. In the current study, we have used the VGG16 pre-trained 
model and proposed a model with a new dataset of hand gestures. The proposed architecture 
demonstrated the correct classification of all training and validation data. Table 01 describes 
the literature summary. 
Table 1: Literature Summary 


Sign : Oe 
Year | Model / Approach Language Signs | Accuracy Description 
; Model shows difficulty 
: American | Finger- ie 
2018 | Restricted Boltzmann : ‘ recognizing characters with 

Sign spelling bee Bille se : 

[20] Machine Deal ene low visual inter-class 
guage | a'p vatiability. 
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2019 | Convolutional Neural ns St ; improved by hand 
Sign spelling 92% 
[19] Network structuring devices and it 
Language | numbers sae 
will increase accuracy. 
Discrete Wavelet ce eal Crees 
2020 | Transform (DWT) : Scope of the study and 
Sign four 96.5% ; 
[17] | and Support Vector ia aoe ane dataset is very low. 
Machine (SVM) suas 8 
: All — 
9p21| ‘Deep Convolutional American English : A similar gesture usually 
Sign ; 96.2% | occurs as a 
ve Nee ee Language Hey misclassification 
nas alphabets , 
on ae Finger- An old dataset of numbers 
2022 | Deep Convolutional ie spelling ; was used. The study has 
Sign 91.41% eee 
[18] Neural Network for limitations in the pose and 
Language 
numbers variance also. 


Existing CNN architectures 

Convolutional Neural Network (CNN) is renowned for its robust feature extraction 
and classification competencies. Introduced various architectures with a constant baseline 
learning mechanism for object recognition and profound learning mechanism improvements 
[22]. Among several famous architectures ate AlexNet, LeNet-05, ResNet-50, and VGG16. 
Alex et al. proposed the AlexNet architecture, which is the first object recognition model that 
tries to learn network parameters over large-scale databases. There are 26 layers of AlexNet 
architecture; this architecture can easily be categorized into three sections. Figure 01 shows 
the AlexNet architecture, the first part consists of 2 units, and both comprise convolution, 
Relu (activation function), normalization, and pooling layers. The second part of the 
architecture includes four units consisting of 2 layers each; convolutional and pooling. The 
non-Linear activation unit was the base of the last section of architecture. Further 
corresponding to Fully Connected (FC) layers, Relu, and drop-out layers. A Drop-out layer is 
used to avoid overfitting during training, and the last two dealings of AlexNet are Softmax and 
Output for seven classes [23]. 

The accuracy of the CNN architecture, for the most part, depends on a vast dataset, 
high-end computational systems, and the depth of the network [24]. The last parameter is 
uncertain for AlexNet architecture due to the unavailability of such measures that could limit 
the network’s depth. 

Another convolutional architecture, LeNet-05 [25] is proposed by LeCun, initially for 
handwritten character recognition. This CNN-based model is capable enough of taking 
multiple objects as input and producing multiple outputs in a single pass without prior 
segmentation, which is called Space Displacement Neural Network (SDCNN) [26]. Sparse 
Convolutional layers and max-pooling layers are the souls of LeNet architecture. Details of 
the model are depicted in figure 02. From the implementation point of view, lower layers of 
the architecture work on 4D tensors, and a further flattened layer convert feature maps into 
the 2D matrix. 
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Figure 1: AlexNet Architecture 
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Figure 2: LeNet-5 Architecture 
An alternative powerful adherent of the CNN family is ResNet50 architecture [27]. 
This model consists of 48 convolutional layers along with max pooling and average pooling 
layers one each. Figure 03 shows the model details. The essence of this model is based on a 
deep residual learning framework; it is specifically intended to solve vanishing gradient 
problems with extreme deep neural nets. This architecture has eminence due to many reasons 
but regardless of 50 layers ensuring around 23 million trainable parameters are making it a 
much smaller network than other existing CNN architectures. 
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Figure 3: ResNet50 architecture 

Here in the current study, we have proposed the architecture of VGG16 with a base 
CNN model for object detection that is described in the next section. In the architecture, 
unlike other networks, the Convolutional, Relu, and Pooling are replicative in the VGG16 
deeper network. Also, they considered the smaller size of the receptive window of each 
convolutional layer [28]. 
METHODOLOGY 
Hand gestures dataset 

The collected dataset used in this research with a mobile device OPPO A76 having 
dual Cameras: 13MP, £/2.2 and LED flash. The selection of the gestures is based on the basic 
seven expressions of sad, happy, neutral, disgust, scared, angry, and surprise. Here to elaborate 
on the seven basic expressions through hand we have selected seven adjectives from Pakistan 
Sign Language, which are expressed as follows: disgust feeling is expressed by a bad adjective, 
the neutral feeling is expressed by the best adjective, the happy feeling is elaborated by glad as 
an adjective, the sad feeling is associated to the sad adjective, just like scared expression is 
associated with the scared adjective, the further the stiff adjective expresses the further angry 
feeling surprise expression has the same surprise adjective in PSL to portray action. Figure 
04(a) depicts the seven basic expressions by hands defined in Pakistan Sign Language. 


0 0 0 7 
50 50 50 ’ / 50 
100 100 100 100 
150 150 150 150 


0 50 100 150 0 5 100 150 0 5 100 150 0 5 100 150 0 5 100 150 0 5 100 150 
a. Bad b. Best c.Glad = d. Sad e. Scarred —f. Stiff g. 
Surprise 


Figure 4(a): Seven hand gestures elaborating seven basic expressions using hands in PSL 
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Figure 4(b): Histograms of seven hand gestures elaborating seven basic expressions in PSL 

All male and female individuals are from 20 to 50 years old. Each participant 
was asked to keep a frontal view to keep the hand posture natural and according to their body 
postures of each participant are reviewed and ensured the subject’s performance and required 
suitable quality on the spot. Attire (shirt) or any other background is not controlled and has 
minor fluctuations in subjects’ position and orientation. All other aspects of data collection 
were controlled. In the overall data set 48% of individuals used the right hand for gesture 
representation and the remaining 52% used the left hand. So, in this state, we can say that data 
is balanced and flipping is done, right or left-hand flipping causes any misclassification after 
model training and system completion. 

As a pteprocessing stage, all images after the acquisition were resized to 312x306 
dimension Figure 04. (b) depicts the histogram representation of all hand gestures expressing 
a specific emotion. Further, the dataset splits in a 75:25 ratio for test and train data. 
Convolutional Neural Network (CNN) 

We cannot overlook Neural Nets when it comes to image recognition or vision-based 
machine learning propagation method is used to perform CNN training in certain specific 
layers. The convolutional layer is one of those hidden layers in the model [29]. The function 
of overall layers is feature extraction and later classifying those features at the end as res is 
treated as a tensor or high order matrix while moving inside the existing layers, on the other 
hand at certain layers training weights are stored which are called parameters. CNN has the 
following general layers [30]: 

Input Layer 

The input layer takes images as input directly. It is the high-order matrix storing 

dimensions of the image, those dimensions are length, width, number of channels, and 

transformations of an input image. 

Convolutional Layer 

The convolutional layer is mainly possible to apply the convolutional process, as well 

this layer stores the weights of training results. This layer provides a feature map with 

a smaller length and width but with more depth. Training weights as parameters can 

be calculated using the below: 

Noparam =(k1*k2*Ninpur*Noutput+Noias) 9 --------- (1) 

In the above equation, k1 and k2 are kernel sizes, Ninpur and Noutpur ate the numbers 

of input and output filters respectively, and Npias is the number of biases which is the 

same as the number of Noutpur filters. 
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Activation Layer 
This layer is the activation function for the convolutional layer. Rectified Linear Unit 
(Relu) is the usual function used due to its light nature, as it changes only negative 
input values to zero while positive input remains. 
Pooling Layer 
This layer retrieves the optimal features of the input tensor by subsampling from the 
previous layer. The output size of the kernel size determines the output size of the 
tensor. This layer works as a regularizer for the model. After downsizing the new 
matrix is a pooled feat, and a pooled feature map is the next layer. 
Fully Connected Layers 
This is the last layer of the model that performs as a classifier. When the overall model 
learns all the features, the final step is flattening the final pooled features by converting 
a feature map matrix into a single column matrix. Artificial neurons in this layer are 
trained and store the training results weights. The softmax activation function is used 
to decide the most dominant class, the formula to formulate the parameters as the 
convolutional layer with the kernel size 1 is given as equation (2) 

Noaram = (Ninput . Noutput ate Ngias) To a ee (2) 

Figure 05. General architecture of CNN 
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Figure 5 describes the general architecture of the Convolutional Neural Network (CNN). Here 
input image splits into many smaller matrices and is provided as input to the convolutional 
layer, after convolving and Relu function it performs the pooling process. These two processes 
are repeated several times to make the network dense. Finally, after feature learning the 
classification, the section consists of flattening and fully connected layers to provide output as 
classes in the final layers. 
VGG-16 architecture using Transfer Learning 

In this paper, we used the VGG16 model architecture. The original model is pre- 
trained with 1.2 million images having RGB color schemes for 1000 classes [31]. The basic 
model of VGG16 has 16 convolutional layers using 3x3 kernel sizes along with the Relu 
activation function. In architecture, each convolutional layer is followed by a max-pooling 
layer and the kernel size is always 2x2. The function of convolutional layers is to automate the 
task of feature extractions and storage of training weights. Further, the final layers as classifiers 
are 3 fully connected layers (FC). The number of parameters is determined collectively by 
convolutional layers and fully connected layers. Figure 06 defines the number of parameter 
output layers' number of parameters and tensor size generates the first 19 layers are extracting 
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features, and layers 20 to 23 are performing the task of classification. Originally, the total 
number of parameters of VGG16 are very high at 134,289,223. 
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28 x28 x5t2 
CONV + RELU 
56 x 56 x 256 MAX POOLING 
CONV + RELU CONV + RELU 
56 x 56 x 256 56 x 56 x 256 


Figure 6. VGG16 transfer learning model 

Proposed Architecture 

The current study is based on a pre-trained VGG16 model. Our 
model consists of two sections: 1. feature extractor, and 2. classifier. Here the first section is 
used to extract features from the new dataset based on 7 hand gestures representing symbolic 
expressions in sign language. The classification part of the model is substituted with a new 
FC-layer to adjust the seven classes of the new dataset. Transfer learning is carried out in the 
feature-extraction section, where weights are convergent. Further, in the classification section, 
another FC-Layer is added to handle 7 classes of our study. Usually, the learning rate in transfer 
learning is always very minimal epochs, as most of the parameters are required to be 
convergent. In this current study, we used 25 epochs with a learning rate of 0.0001. Figure 07 
elaborates on the difference between the original VGG16 architecture and the new proposed 
architecture adjusting the PSL hand expression dataset. 


FC6, FC7, New-FC, 
VGG16 Feature VGG16 Feature eid 
(a) FC8, number number 


Extraction Extraction 
output 1000 output 07 


Figure 07: (a) original VGG16 model, (b) proposed model 

RESULTS AND DISCUSSIONS 

To analyze the test scenario for our finned tuned architecture of 
VGG16 with Adam optimizer, results are represented graphically to make them clearer. More 
specifically (a) and (b) in figure 08 show the accuracy and loss graphs, which follow the training 
graph very well and achieve higher accuracy with only 25 epochs. Besides the accuracy and 
loss matrices we have, we have also used class-wise Average, Precision, and F1-score, Area 
Under the Curve (AUC) along with confusion mattix. 
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Accuracy is the ratio between total observations and correct predictions. Our accuracy is near 
100 percent which, shows that our model is best for our case, but still, for any irregularity, we 
must check other parameters to justify the performance of the model. Equation (3) to (6) 
represents the formulas for all evaluation matrices. 


Accuracy 
True Positive + True Nagative 


(2) 


~ True Positive + False Positive + False Negative + True Negative 


pean True Positive 3 
recision = ———_£_—— 
True Positive + False Positive (3) 


True Positive 
kecall = =—— (4) 
True Positive + False Negative 


2 * (Recall * Precision) 
F1 Score = ———__—__— (S) 
(Recall + Precision) 


To measure the model as a classifier, we are using Area Under the Curve (AUC) which is the 
summary of the ROC curve. The higher the AUC curve, the higher the accuracy to qualify any 
class as positive or negative. 


Accuracy over 25 Epochs Loss over 25 Epochs 
25 7 
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—— Fain_recall 
——~ Val_recall 


0.9 + 
20 4 


054 


= Tain_precision 


~~ Val_precision 00 


Figure 08 (a) (b) 

(a) Accuracy graph for training and validation of the transfer learning model 

(b) Loss graph for training and validation of the transfer learning model 
Further, figure 09 represents Precision, Recall, Fl Score, and AUC. Here precision and recall 
are the ratios between total positive predictions, and overall observations and positive 
predictions respectively. The recall is also called the sensitivity of the model. Subsequently, the 
weighted average between two previously calculated ratios of precision and recall is the F1 
score of the given model, which gives better results as compared to accuracy even with 
imbalanced classes. These all ratios are at the ideal condition of 1 using our tuned transfer 
learning model. 
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Precision over 25 Epochs Recall over 25 Epochs 
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Figure 9 (a) Precision (b) Recall (c) F1-Score (d) AUC graphs 
Further, in figure 09 (d) Area Under the Curve (AUC) shows 0.96 value for training and 0.95 
value for validation representation. When AUC is between 0.5 to 1, it shows the classifier is 
able to distinguish the positive classes. For our case, results are far promising and prove the 
current model fit for the hand gesture classification problem. Similarly, in figure 10 Confusion 
matrix depict the error matrix and gives a remarkable result of 100 Percent accurate 
predictions. 
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Figure 10: Confusion Matrix / Error Matrix 
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Moreover, in figure 10 prediction and testing of class-wise results are demonstrated to 
analyze the robustness of the model using transfer learning. All classes are classified between 
99 to 100% valid class detection over these facts that hand orientations, backgrounds, 
brightness, and other features are variated for the overall dataset, our results are quite 
promising. Figure 10 predicts several hand gestures of PSL emotion-based hand gesture 
recognition accurately. 

In the overall study, our analysis is based on three experiments. All three experiments 
are carried out using a dataset of Pakistan Sign Language, depicting human emotions based on 
hand gestures. Seven basic emotions of sad, bad, happy, neutral, disgust, anger, and surprise 
are considered. VGG16 architecture with pertained weights and Adam optimizer used for 
transfer learning on this image classification task. In the current and previous sections, this 
technique is elaborated well as it was evaluated that it is providing the best average results in 
terms of “Accuracy”, “Loss”, “Precision”, “Recall”, and,” F1-Score”’. Further, in the second 
experimentation VGG16 architecture is used with all constraints of transfer learning but with 
an SGD optimizer. And in the third experimentation, three more state-of-the-art architecture 
of AlexNet, LeNet05, and ResNet50 is used for our image classification task. Both, the second 
and third experimentations are not so promising. Table 02 shows a comparison between CNN 
architectures using accuracy, precision, recall, and F1-Score as an average evaluation score for 
out seven class’s image recognition tasks. Table 02 witnessed the accuracy of VGG16 
architecture with Adam Optimizer among other architectures. 

Table 2: comparison between AlexNet, VGG16 Adam optimizer, and VGG16 SGD optimizer 


Method Accuracy Precision Recall F1-Score 
AlexNet 96.88 96.56 96.45 96.35 
LeNet05 89.45 88.76 88.54 89.01 
ResNet50 97.32 T7105 97:58 97.56 
VGG16 (SGD) OFZ 97.23 9722 97.10 
VGG16 (Adam) 99.98 99.98 99.98 99.89 


The overall classification accuracy of our model is 100%, which is definitely at maximum than 
existing work in this current domain. While using more challenging and robust dataset in terms 
of small and low quality; Table 03 defines a comparison of model performances of our model 
with other publish state-of-the-artwork. 

Table 03: Comparison of model performances with other published work 


Ref. No. Classification Technique Classification Accuracy 
[32] Image Net + VGG16 trained network 94.56% 
[33] DNN + Transfer Learning 93,976 
[34] CNN + Ensemble Model 96% 
[3] 3DCNN+LSTM 93.09% 
Our CNN +VGG16 (with Adam Optimizer) 100% 
Model 


We have selected the deep learning approach and proposed a network as the importance of 
the Artificial Intelligence, Machine Learning and Deep Learning cannot be neglect in this era. 
Machine Learning and Deep learning are being used in different area like Medical, Computer 
Vision, Sentiment, Natural Language Processing and Internet of Things [35]—[40]. 
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CONCLUSION AND FUTURE DIRECTIONS 


In the current study, we used transfer-learning based state-of-art-networks AlexNet, LeNet05, 
ResNet50, VGG16 with SGD optimizer, and VGG16 with Adam optimizer over a newly 
developed dataset for Pakistan Sign Language, consisting of seven classes of expression 
represented by hand gestures. The performance of these networks is measured by accuracy, 
loss, Precision, Recall, F1-Score, and AUC. Performance analysis witnessed improvement in 
VGG16 architecture while using Adam Optimizer. In the future, these fine-tuned network 
architectures can be used for real-time sign language recognition systems. Additionally, sign 
language combines of manual and non-manual gestures, manual gestures are gestures 
performed using hand signs either single or double. On the other hand, non-manual signs are 
gestures based on body movement, head movement, torso, and facial expressions. The current 
study besieged seven basic expressions of sad, happy, disgust, anger, neutral, scared, and 
surprise expressions through hands but it is very important to combine facial expressions and 
body movements along with such adjectives to enhance the meaning of such expressions using 
hand and facial features. So, in the future current architecture can be used for human action 
recognition and specifically for sign language action recognition, non-manual sign language 
recognition, or for the hybrid model for manual and non-manual hybrid sign language 
modalities recognition. 
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